Lab 3: Modeling correlation and regression

Practice session covering topics discussed in Lecture 1

M. Chiara Mimmi, Ph.D. | Università degli Studi di Pavia

July 27, 2024

GOAL OF TODAY’S PRACTICE SESSION

The examples and code from this lab session follow very closely the book:

Topics discussed in Lecture # 3

understand

Lecture 3: topics

  • Testing for a correlation hypothesis (relationship of variables)
    • Pearson rho analysis (param)
    • Spearman test (no param)
  • Measures of association
    • Fisher’s Exact Test
    • Chi-Square Test of Independence
  • From correlation/association to causation
    • introduction to experiments
      • Example: Linear regression models
      • Example: Multiple Linear Regression
  • From causation to prediction
    • introduction to Machine Learning
      • Supervised algorithms
      • Unsupervised algorithms

R ENVIRONMENT SET UP & DATA

Needed R Packages

  • We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
  • We will also use the packages below (specifying package::function for clarity).
# Load them for this R session

# General 
library(fs)      # file/directory interactions
library(here)    # tools find your project's files, based on working directory
library(janitor) # tools for examining and cleaning data
library(dplyr)   # {tidyverse} tools for manipulating and summarizing tidy data 
library(forcats) # {tidyverse} tool for handling factors
library(openxlsx) # Read, Write and Edit xlsx Files
library(flextable) # Functions for Tabular Reporting

# Statistics
library(rstatix) # Pipe-Friendly Framework for Basic Statistical Tests
library(broom) # Convert Statistical Objects into Tidy Tibbles
library(tidymodels) # not installed on this machine
# Plotting
library(ggplot2) # Create Elegant Data Visualisations Using the Grammar of Graphics
library(ggpubr) # 'ggplot2' Based Publication Ready Plots

# Data 
# devtools::install_github("OI-Biostat/oi_biostat_data")
#library(oibiostat) 
# load some colors
colors <- readRDS(here::here("practice", "data_input", "03_datasets","colors.rds"))

DATASETS for today

Importing Dataset 1 (NHANES)

  • Adapting the function here to match your own folder structure

Name: NHANES (National Health and Nutrition Examination Survey) combines interviews and physical examinations to assess the health and nutritional status of adults and children in the United States. Sterted in the 1960s, it became a continuous program in 1999. Documentation: dataset1
Sampling details: Here we use a sample of 500 adults from NHANES 2009-2010 & 2011-2012 (nhanes.samp.adult.500 in the R oibiostat package, which has been adjusted so that it can be viewed as a random sample of the US population)

# Check my working directory location
# here::here()

# Use `here` in specifying all the subfolders AFTER the working directory 
nhanes_samp <- read.csv(file = here::here("practice", "data_input", "03_datasets",
                                      "nhanes_samp.csv"), 
                          header = TRUE, # 1st line is the name of the variables
                          sep = ",", # which is the field separator character.
                          na.strings = c("?","NA" ), # specific MISSING values  
                          row.names = NULL) 

- NHANES Variables and their description

Variable Type Description
X int xxxx
ID int xxxxx
SurveyYr chr yyyy_mm. Ex. 2011_12
Gender chr Gender (sex) of study participant coded as male or female
Age int ##
AgeDecade chr yy-yy es 20-29
Race1 chr Reported race of study participant: Mexican, Hispanic, White, Black, or Other
Race3 chr Reported race of study participant... Not availale for 2009-10
Education chr [>= 20 yro]. Ex. 8thGrade, 9-11thGrade, HighSchool, SomeCollege, or CollegeGrad.
MaritalStatus chr [>= 20 yro]. Ex. Married, Widowed, Divorced, Separated, NeverMarried, or LivePartner
HHIncome chr Total annual gross income for the household in US dollars
HHIncomeMid int Numerical version of HHIncome derived from the middle income in each category. Ex. 12500 40000
Poverty dbl A ratio of family income to poverty guidelines. Smaller numbers indicate more poverty Ex.. 0.95 1.74 4.99
HomeRooms int How many rooms are in home of study participant (counting kitchen but not bath room).
HomeOwn chr One of Home, Rent, or Other
Work chr NotWorking Working
Weight dbl Weight in kg
Height dbl Standing height in cm. Reported for participants aged 2 years or older.
BMI dbl Body mass index (weight/height2 in kg/m2). Reported for participants aged 2 years or older
Pulse int 60 second pulse rate
BPSysAve int Combined systolic blood pressure reading, following the procedure outlined for BPXSAR
BPDiaAve int Combined diastolic blood pressure reading, following the procedure outlined for BPXDAR
BPSys1 int Systolic blood pressure in mm Hg – first reading
BPDia1 int Diastolic blood pressure in mm Hg – second reading (consecutive readings)
BPSys2 int Systolic blood pressure in mm Hg – second reading (consecutive readings)
BPDia2 int Diastolic blood pressure in mm Hg – second reading
BPSys3 int Systolic blood pressure in mm Hg third reading (consecutive readings)
BPDia3 int Diastolic blood pressure in mm Hg – third reading (consecutive readings)
Testosterone dbl Testerone total (ng/dL). Reported for participants aged 6 years or older. Not available for 2009-2010
DirectChol dbl Direct HDL cholesterol in mmol/L. Reported for participants aged 6 years or older
TotChol dbl Total HDL cholesterol in mmol/L. Reported for participants aged 6 years or older
UrineVol1 int Urine volume in mL – first test. Reported for participants aged 6 years or older
UrineFlow1 dbl Urine flow rate in mL/min – first test. Reported for participants aged 6 years or older
UrineVol2 int Urine volume in mL – second test
UrineFlow2 dbl Urine flow rate (urine volume/time since last urination) in mL/min – second test
Diabetes chr Study participant told by a doctor or health professional that they have diabetes
DiabetesAge int Age of study participant when first told they had diabetes
HealthGen chr Self-reported rating of health: Excellent, Vgood, Good, Fair, or Poor Fair
DaysPhysHlthBad int Self-reported # of days participant’s physical health was not good out of the past 30 days
DaysMentHlthBad int Self-reported # of days participant’s mental health was not good out of the past 30 days
LittleInterest chr Self-reported # of days where participant had little interest in doing things. Among: None, Several, Majority, or AlmostAll
Depressed chr Self-reported # of days where participant felt down, depressed or hopeless. Among: None, Several, Majority, or AlmostAll
nPregnancies int # times participant has been pregnant
nBabies int # deliveries resulted in live births
PregnantNow chr Pregnancy status ascertained for females 8-59 years of age
Age1stBaby int Age of participant at time of first live birth
SleepHrsNight int Self-reported # of hours study participant gets at night on weekdays or workdays. For participants aged 16 years and older
SleepTrouble chr Participant [16 years and older] has had trouble sleeping. Coded as Yes or No.
PhysActive chr Participant does moderate or vigorous-intensity sports, fitness or recreational activities (Yes or No).
PhysActiveDays int Number of days in a typical week that participant does moderate or vigorous intensity activity.
TVHrsDay chr Number of hours per day on average participant watched TV over the past 30 days.
CompHrsDay chr Number of hours per day on average participant used a computer or gaming device over the past 30 day
TVHrsDayChild lgl [2-11 yro] Number of hours per day on average participant watched TV over the past 30 days.
CompHrsDayChild lgl [2-11 yro] Number of hours per day on average participant used a computer or gaming device over the past 30 day
Alcohol12PlusYr chr Participant has consumed at least 12 drinks of any type of alcoholic beverage in any one year
AlcoholDay int Average number of drinks consumed on days that participant drank alcoholic beverages
AlcoholYear int [>+ 18yro] Estimated number of days over the past year that participant drank alcoholic beverages
SmokeNow chr Study participant currently smokes cigarettes regularly. (Yes or No)
Smoke100 chr Study participant has smoked at least 100 cigarettes in their entire life. (Yes pr No)
Smoke100n chr Smoker Non-Smoker
SmokeAge int Age study participant first started to smoke cigarettes fairly regularly
Marijuana chr Participant has tried marijuana
AgeFirstMarij int Age Participant has tried marijuana first
RegularMarij chr Participant has been/is a regular marijuana user (used at least once a month for a year) (Yes or No)
AgeRegMarij int Age of participant when first started regularly using marijuana
HardDrugs chr Participant has tried cocaine, crack cocaine, heroin or methamphetamine (Yes or No)
SexEver chr Participant had had sex (Yes or No)
SexAge int Age Participant had had sex first time
SexNumPartnLife int Number of opposite sex partners participant has had
SexNumPartYear int Number of opposite sex partners over the past 12 months
SameSex chr Participant has had any kind of sex with a same sex partne(Yes or No)
SexOrientation chr Participant’s sexual orientation One of Heterosexual, Homosexual, Bisexual

Importing Dataset 2 (PREVEND)

Data from the Prevention of Renal and Vascular End-stage Disease (PREVEND) study, which took place in the Netherlands. The study collected various demographic and cardiovascular risk factors. This dataset is from the third survey, which participants completed in 2003-2006; data is provided for 4,095 individuals who completed cognitive testing.

Name: PREVEND (Prevention of REnal and Vascular END-stage Disease) is a study which took place in the Netherlands on 8,592 participants aged 28-75, with subsequent follow-ups in 1997-1998. A second survey was conducted in 2001-2003 on 6,894 participants, 5,862 completed the third survey in 2003-2006 (here measurement of cognitive function was added to the study protocol).
Documentation: dataset2
Sampling details: Data from 4,095 individuals who completed cognitive testing are in the prevend dataset, available in the R package oibiostat.

# Check my working directory location
# here::here()

# Use `here` in specifying all the subfolders AFTER the working directory 
prevend <- read.csv(file = here::here("practice", "data_input", "03_datasets",
                                      "prevend.csv"), 
                          header = TRUE, # 1st line is the name of the variables
                          sep = ",", # which is the field separator character.
                          na.strings = c("?","NA" ), # specific MISSING values  
                          row.names = NULL) 

- PREVEND Variables and their description

Variable Description
... ...

Importing Dataset 3 (FAMuSS)

Name: FAMuSS (Functional SNPs Associated with Muscle Size and Strength) examine the association of demographic, physiological and genetic characteristics with muscle strength – including data on race and genotype at a specific locus on the ACTN3 gene (the “sports gene”).
Documentation: dataset3
Sampling details: the DATASET includes 595 observations on 9 variables

# Check my working directory location
# here::here()

# Use `here` in specifying all the subfolders AFTER the working directory 
famuss <- read.csv(file = here::here("practice", "data_input", "03_datasets",
                                      "famuss.csv"), 
                          header = TRUE, # 1st line is the name of the variables
                          sep = ",", # which is the field separator character.
                          na.strings = c("?","NA" ), # specific MISSING values  
                          row.names = NULL) 

- FAMuSS Variables and their description

Variable Description
X id
ndrm.ch Percent change in strength in the non-dominant arm
drm.ch Percent change in strength in the dominant arm
sex Sex of the participant
age Age in years
race Recorded as African Am (African American), Caucasian, Asian, Hispanic, Other
height Height in inches
weight Weight in pounds
actn3.r577x Genotype at the location r577x in the ACTN3 gene.
bmi Body Mass Index

CORRELATION

Explore relationships between two variables

Approaches for summarizing relationships between two variables vary depending on variable types…

  • Two numerical variables
  • Two categorical variables
  • One numerical variable and one categorical variable

Two variables \(x\) and \(y\) are

  • positively associated if \(y\) increases as \(x\) increases.
  • negatively associated if \(y\) decreases as \(x\) increases.

TWO NUMERICAL VARIABLES (NHANES)

Two numerical variables (plot)

Height and weight (taken from the nhanes_samp dataset) are positively associated.

  • notice we can also use the generic function base::plot for a simple scatter plot
# rename for convenience
nhanes <- nhanes_samp %>% 
  janitor::clean_names()

# basis plot 
plot(nhanes$height, nhanes$weight,
     xlab = "Height (cm)", ylab = "Weight (kg)", cex = 0.8)  

Two numerical variables (plot)

Two numerical variables: correlation (with stats::cor)

Correlation is a numerical summary that measures the strength of a linear relationship between two variables.

  • The correlation coefficient \(r\) takes on values between \(-1\) and \(1\).

  • The closer \(r\) is to \(\pm 1\), the stronger the linear association.

  • Here we compute the Pearson rho (parametric), with the function cor

    • notice the use argument to choose how to deal with missing values (in this case only using all complete pairs)
is.numeric(nhanes$height) 
[1] TRUE
is.numeric(nhanes$weight)
[1] TRUE
# using `stats` package
cor(x = nhanes$height, y =  nhanes$weight, 
    # argument for dealing with missing values
    use = "pairwise.complete.obs",
    method = "pearson")
[1] 0.4102269

Two numerical variables: correlation (or with stats::cor.test)

  • Here we compute the Pearson rho (parametric), with the function cor.test (the same we used for testing paired samples)
    • implicitely takes care on NAs
# using `stats` package 
cor_test_result <- cor.test(x = nhanes$height, y =  nhanes$weight, 
                            method = "pearson")

# looking at the cor estimate
cor_test_result[["estimate"]][["cor"]]
[1] 0.4102269
  • The function ggpubr::ggscatter gives us all in one (scatter plot + \(r\) (“R”))
library("ggpubr") # 'ggplot2' Based Publication Ready Plots
ggpubr::ggscatter(nhanes, x = "height", y = "weight", 
                  cor.coef = TRUE, cor.method = "pearson", #cor.coef.coord = 2,
                  xlab = "Height (in)", ylab = "Weight (lb)")

Two numerical variables: correlation (or with stats::cor.test)

Spearman rank-order correlation

The Spearman’s rank-order correlation is the nonparametric version of the Pearson correlation.

Spearman’s correlation coefficient, (\(ρ\), also signified by \(rs\)) measures the strength and direction of association between two ranked variables.

  • used when 2 variables have a non-linear relationship
  • excellent for ordinal data (when Pearson’s is not appropriate), i.e. Likert scale items

To compute it, we simply calculate Pearson’s correlation of the rankings of the raw data (instead of the data).

Spearman rank-order correlation (example)

Let’s say we want to get Spearman’s correlation with ordinal factors Education and HealthGen in the NHANES sample. I have to convert them to their underlying numeric code, to compare rankings.

tabyl(nhanes$education)
 nhanes$education   n percent valid_percent
        8th Grade  32   0.064    0.06412826
   9 - 11th Grade  68   0.136    0.13627255
     College Grad 157   0.314    0.31462926
      High School  94   0.188    0.18837675
     Some College 148   0.296    0.29659319
             <NA>   1   0.002            NA
tabyl(nhanes$health_gen)
 nhanes$health_gen   n percent valid_percent
         Excellent  47   0.094    0.10444444
              Fair  53   0.106    0.11777778
              Good 177   0.354    0.39333333
              Poor  11   0.022    0.02444444
             Vgood 162   0.324    0.36000000
              <NA>  50   0.100            NA
nhanes <- nhanes %>% 
  # reorder education
  mutate (edu_ord = factor (education, 
                            levels = c("8th Grade", "9 - 11th Grade",
                                       "High School", "Some College",
                                       "College Grad" , NA))) %>%  
  # create edu_rank 
  mutate (edu_rank = as.numeric(edu_ord)) %>% 
  # reorder health education
  mutate (health_ord = factor (health_gen, 
                            levels = c( NA, "Poor", "Fair",
                                       "Good", "Vgood",
                                       "Excellent"))) %>%
  # create health_rank 
  mutate (health_rank = as.numeric(health_ord)) 

table(nhanes$edu_ord, useNA = "ifany" )

     8th Grade 9 - 11th Grade    High School   Some College   College Grad 
            32             68             94            148            157 
          <NA> 
             1 
table(nhanes$edu_rank, useNA = "ifany" )

   1    2    3    4    5 <NA> 
  32   68   94  148  157    1 
table(nhanes$health_ord, useNA = "ifany" )

     Poor      Fair      Good     Vgood Excellent      <NA> 
       11        53       177       162        47        50 
table(nhanes$health_rank,  useNA = "ifany" )

   1    2    3    4    5 <NA> 
  11   53  177  162   47   50 

Spearman rank-order correlation (example cont.)

Now we can actually compute it

  • After setting up the variables correctly (as.numeric), just specify method = "spearman"
# using `stats` package 
cor_test_result_sp <- cor.test(x = nhanes$edu_rank,
                               y = nhanes$health_rank, 
                               method = "spearman", 
                               exact = FALSE) # removes the Ties message warning 
# looking at the cor estimate
cor_test_result_sp

    Spearman's rank correlation rho

data:  nhanes$edu_rank and nhanes$health_rank
S = 10641203, p-value = 1.915e-10
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.2946493 
#cor_test_result_sp[["estimate"]][["rho"]]

TWO CATEGORICAL VARIABLES (FAMuSS)

Two categorical variables (plot)

In the famuss dataset, the variables race, and actn3.r577x are categorical variables.

## genotypes as columns
genotype.race = matrix(table(famuss$actn3.r577x, famuss$race), ncol=3, byrow=T)
colnames(genotype.race) = c("CC", "CT", "TT")
rownames(genotype.race) = c("African Am", "Asian", "Caucasian", "Hispanic", "Other")

barplot(genotype.race, col = colors[c(7, 4, 1, 2, 3)], ylim=c(0,300), width=2)
legend("topright", inset=c(.05, 0), fill=colors[c(7, 4, 1, 2, 3)], 
       legend=rownames(genotype.race))

Two categorical variables (contingency table)

Specifically, the variable actn3.r577x takes on three possible levels (CC, CT, or TT) which indicate the distribution of genotype at location r577x on the ACTN3 gene for the FAMuSS study participants.

A contingency table summarizes data for two categorical variables.

  • the function stats::addmargins puts arbitrary Margins on Multidimensional Tables
    • The extra column & row "Sum" provide the marginal totals across each row and each column, respectively
# levels of actn3.r577x
table(famuss$actn3.r577x)

 CC  CT  TT 
173 261 161 
# contingency table to summarize race and actn3.r577x
addmargins(table(famuss$race, famuss$actn3.r577x))
            
              CC  CT  TT Sum
  African Am  16   6   5  27
  Asian       21  18  16  55
  Caucasian  125 216 126 467
  Hispanic     4  10   9  23
  Other        7  11   5  23
  Sum        173 261 161 595

Two categorical variables (contingency table prop)

Contingency tables can also be converted to show proportions. Since there are 2 variables, it is necessary to specify whether the proportions are calculated according to the row variable or the column variable. + using the margin = argument in the base::prop.table function (1 indicates rows, 2 indicates columns)

# adding row proportions
addmargins(prop.table(table(famuss$race, famuss$actn3.r577x), margin =  1))
            
                    CC        CT        TT       Sum
  African Am 0.5925926 0.2222222 0.1851852 1.0000000
  Asian      0.3818182 0.3272727 0.2909091 1.0000000
  Caucasian  0.2676660 0.4625268 0.2698073 1.0000000
  Hispanic   0.1739130 0.4347826 0.3913043 1.0000000
  Other      0.3043478 0.4782609 0.2173913 1.0000000
  Sum        1.7203376 1.9250652 1.3545972 5.0000000
# adding column proportions
addmargins(prop.table(table(famuss$race, famuss$actn3.r577x),margin =  2))
            
                     CC         CT         TT        Sum
  African Am 0.09248555 0.02298851 0.03105590 0.14652996
  Asian      0.12138728 0.06896552 0.09937888 0.28973168
  Caucasian  0.72254335 0.82758621 0.78260870 2.33273826
  Hispanic   0.02312139 0.03831418 0.05590062 0.11733618
  Other      0.04046243 0.04214559 0.03105590 0.11366392
  Sum        1.00000000 1.00000000 1.00000000 3.00000000

Chi Squared test of independence

The Chi-squared test is a hypothesis test used to determine whether there is a relationship between two categorical variables.

  • categorical vars. can have nominal or ordinal measurement scale
  • the observed frequencies are compared with the expected frequencies and their deviations are examined.
# Chi-squared test
# (Test of association to see if 
# H0: the 2 cat var (race  & actn3.r577x ) are independent
# H1: the 2 cat var are correlated in __some way__

tab <- table(famuss$race, famuss$actn3.r577x)
test_chi <- chisq.test(tab)

the obtained result (test_chi) is a list of objects…

You try…

…run View(test_chi) to check

Chi Squared test of independence (cont)

Within test_chi results there are:

  • Observed frequencies = how often a combination occurs in our sample
# Observed frequencies
test_chi$observed
            
              CC  CT  TT
  African Am  16   6   5
  Asian       21  18  16
  Caucasian  125 216 126
  Hispanic     4  10   9
  Other        7  11   5
  • Expected frequencies = what would it be if the 2 vars were PERFECTLY INDEPENDENT
# Expected frequencies
round(test_chi$expected  , digits = 1 )
            
                CC    CT    TT
  African Am   7.9  11.8   7.3
  Asian       16.0  24.1  14.9
  Caucasian  135.8 204.9 126.4
  Hispanic     6.7  10.1   6.2
  Other        6.7  10.1   6.2

Chi Squared test of independence (res)

  • Recall that
    • \(H_{0}\): the 2 cat. var. are independent
    • \(H_{1}\): the 2 cat. var. are correlated in some way
  • The result of Chi-Square test represents a comparison of the above two tables (observed v. expected):
    • p-value = 0.01286 smaller than α = 0.05 so we REJECT the null hypothesis
test_chi

    Pearson's Chi-squared test

data:  tab
X-squared = 19.4, df = 8, p-value = 0.01286

Computing Cramer’s V after test of independence (by hand)

Recall that Crammer’s V is a way in which we can measure effect size of the test of independence (i.e. a measure of the strength of association between two nominal variables)

  • V ranges from [0 1] (the smaller V, the lower the correlation)

\[V=\sqrt{\frac{\chi^2}{n(k-1)}} \]

where:

  • \(V\) denotes Cramér’s V
  • \(\chi^2\) is the Pearson chi-square statistic from the aforementioned test;
  • \(n\) is the sample size involved in the test and
  • \(k\) is the lesser number of categories of either variable

Computing Cramer’s V after test of independence (2 ways)

  • ✍🏻 By hand” first to see the steps
# Compute Creamer's V
 
# inputs 
chi_calc <- test_chi$statistic
n <- nrow(famuss) # N of obd 
n_r <- nrow(test_chi$observed) # number of rows in the contingency table
n_c <- ncol(test_chi$observed) # number of columns in the contingency table

# Cramer’s V
sqrt(chi_calc / (n*min(n_r -1, n_c -1)) )
X-squared 
0.1276816 
# Cramer’s V 
rstatix::cramer_v(test_chi$observed)
[1] 0.1276816

A Cramer’s V of 0.12 indicates a relatively weak association between the two categorical variables. It suggests that while there may be some relationship between the variables, it is not particularly strong

Chi Squared test of goodness of fit

In this case, we are conducting a type of Pearson’s Chi-square test

  • used to test whether the observed distribution of a categorical variable differs from your expectations
  • the statistic is based on the discrepancies between observed and expected counts

Chi Squared test of goodness of fit (example)

Since the participants of the FAMuSS study where volunteers at a university, they did not come from a “representative” sample of the US population

  • The \(\chi^{2}\) test can be used to test the \(H_{0}\) that the participants are racially representative of the general population

Race

African.American

Asian

Caucasian

Other

Total

FAMuSS (Observed)

27

55

467

46

595

US Census (Expected)

76.16

5.95

478.38

34.51

595

We use the formula \(\chi^{2} = \sum_{k}\frac{(Observed - Expected)^{2}}{Expected}\),
under \(H_{0}\) = the sample proportions should equal the population proportions.

Chi Squared test of goodness of fit (example)

# Subset the vectors of frequencies from the 2 rows  
observed <- c(27,  55,  467, 46)
expected <- c(76.2,  5.95, 478.38,  34.51)

# Calculate Chi-Square statistic manually 
chi_sq_statistic <- sum((observed - expected)^2 / expected) 
df <- length(observed) - 1 
p_value <- 1 - pchisq(chi_sq_statistic, df) 

# Print results 
chi_sq_statistic
[1] 440.2166
df
[1] 3
p_value 
[1] 0

The calculated \(\chi^{2}\) statistic is very large, and the p_value is close to 0. Hence, There is more than sufficient evidence to reject the null hypothesis that the sample is representative of the general population.
A comparison of the observed and expected values (or the residuals) indicates that the largest discrepancy is with the over-representation of Asian participants.

1 CATEGORICAL & 1 NUMERICAL VARIABLES (???)

FROM CORRELATION/ ASSOCIATION TO CAUSATION

Splitting the sample

If we seek any causal relationship between an explanatory and outcome variable we should split our sample to have:

  1. one sample to “train” a model on
  2. one sample to “test” the model on
  • Otherwise out model will seem better than it is (since it will be specifically built to “fit” our data)

  • The function rsample::initial_split will assist in that

nhanes_split <- rsample::initial_split(nhanes)
# default = 75%  of observations 
nhanes_train <- rsample::training(nhanes_split)
# default = 25%  of observations 
nhanes_test <- rsample::testing(nhanes_split)

Visualize the data

We are mainly looking for a “vaguely” linear shape here

  • ggplot2 gives us a visual confirmation via linear best fit (the least squares regression line), using method = lm, & its 95% CI
ggplot(nhanes_train, aes (x = age, 
                          y = bmi)) + 
  geom_point() + 
  geom_smooth(method = lm,  
              #se = FALSE
              )

Visualize the data

Linear regression models

The lm() function is used to fit linear models has the following generic structure:

lm(y ~ x, data)

where:

  • the 1st argument y ~ x specifies the variables used in the model (here the model regresses a response variable \(y\) against an explanatory variable \(x\).
  • The 2nd argument data is used only when the dataframe name is not already specified in the first argument.

Linear regression models syntax

The following example shows fitting a linear model that predicts BMI from age (in years) using data from nhanes_train adult sample (individuals 21 years of age or older from the NHANES data).

#fitting linear model
lm(nhanes_train$bmi ~ nhanes_train$age)
#equivalently...
lm(bmi ~ age, data = nhanes_train)

Call:
lm(formula = bmi ~ age, data = nhanes_train)

Coefficients:
(Intercept)          age  
   28.64243      0.01265  
  • Running the function creates an object (of class lm) that contains several components (model coefficients, etc), either directly displayed or accessible with summary() notation or specific functions.

Linear regression models syntax

We can save the model and then extract individual output elements from it using the $ syntax

# name the model object
model <- lm(bmi ~ age, data = nhanes_train)

# extract model output elements
model$coefficients
model$residuals
model$fitted.values

The command summary returns these elements

  • Call: reminding the equation used
  • Residuals: a 5 number summary of the Residuals
  • Coefficients: estimates, and hypothesis testing
    • intercept
    • explanatory variable slope

Linear regression models interpretation: coefficients

  • The model tests the null hypothesis that a coefficient is 0
  • Coefficients: estimates, std. error of these estimates, t value and p-value
    • intercept
    • explanatory variable slope
  • In regression, the population parameter of interest is typically the slope parameter
    • here age doesn’t appear significantly ≠ 0
summary(model)

Linear regression models interpretation: coefficients


Call:
lm(formula = bmi ~ age, data = nhanes_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.186  -4.871  -1.250   3.812  39.570 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 28.64243    1.12495  25.461   <2e-16 ***
age          0.01265    0.02133   0.593    0.554    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.795 on 370 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.0009493, Adjusted R-squared:  -0.001751 
F-statistic: 0.3516 on 1 and 370 DF,  p-value: 0.5536

Linear regression models outputs: fitted values

model$fitted.values
       1        2        3        4        5        6        7        8 
29.43938 29.22433 29.03458 29.31288 29.50263 29.18638 29.30023 29.31288 
       9       10       11       12       13       14       15       16 
29.45203 29.23698 29.65443 29.55323 28.93338 29.27493 29.48998 29.27493 
      17       18       19       20       21       22       23       24 
29.22433 29.07253 29.08518 29.52793 29.21168 29.13578 28.97133 29.05988 
      25       26       27       28       30       31       32       33 
29.42673 29.02193 29.65443 28.93338 29.27493 29.11048 29.35083 29.46468 
      34       35       36       37       38       39       40       41 
29.37613 29.13578 29.11048 29.00928 29.07253 29.22433 29.37613 29.59118 
      42       43       44       45       46       47       48       49 
29.45203 29.42673 29.31288 29.13578 28.90808 29.36348 29.52793 29.37613 
      50       51       52       53       54       55       56       57 
29.00928 29.09783 29.07253 29.33818 29.27493 29.00928 29.21168 29.32553 
      58       59       60       61       62       63       64       65 
29.52793 29.19903 28.94603 29.12313 29.05988 29.45203 29.24963 29.52793 
      66       67       68       69       70       71       72       73 
29.32553 29.47733 29.31288 29.24963 29.26228 28.93338 29.47733 29.37613 
      74       75       76       77       78       79       80       81 
29.11048 29.23698 29.35083 29.54058 29.38878 29.14843 29.65443 29.28758 
      82       83       84       85       86       87       88       89 
29.04723 29.36348 29.52793 28.97133 29.22433 29.40143 29.24963 29.17373 
      90       91       92       93       94       95       96       98 
29.54058 29.19903 29.14843 29.23698 29.35083 28.90808 29.65443 29.23698 
      99      100      102      103      104      105      106      107 
29.13578 29.33818 29.65443 29.55323 29.47733 29.21168 28.95868 29.12313 
     108      109      110      111      112      113      114      115 
29.42673 29.28758 28.90808 29.32553 28.95868 28.99663 29.23698 28.94603 
     116      117      118      119      120      121      122      123 
28.92073 29.65443 29.14843 29.35083 29.21168 29.41408 29.56588 29.14843 
     124      125      126      127      128      129      130      131 
28.97133 28.90808 29.04723 28.90808 28.97133 29.55323 28.97133 29.51528 
     132      133      134      135      136      137      138      139 
28.99663 28.98398 29.36348 29.14843 29.36348 29.32553 29.08518 29.21168 
     140      141      142      143      144      145      146      147 
28.90808 29.30023 29.31288 29.65443 29.54058 28.90808 29.56588 29.23698 
     148      149      150      151      152      153      154      155 
29.57853 29.09783 29.04723 29.54058 29.54058 29.05988 29.36348 29.02193 
     156      157      158      159      160      161      162      163 
29.11048 29.00928 29.16108 28.98398 29.19903 29.50263 29.24963 29.43938 
     164      165      166      167      168      169      170      171 
29.40143 29.04723 28.98398 28.95868 29.65443 28.97133 29.42673 29.35083 
     172      173      174      175      176      177      178      179 
29.60383 29.02193 29.00928 29.47733 29.65443 29.57853 29.35083 29.60383 
     180      181      182      183      184      185      186      187 
29.17373 29.04723 29.26228 29.60383 29.33818 29.24963 29.65443 29.27493 
     188      189      190      191      192      193      194      195 
29.30023 29.32553 28.93338 29.46468 29.30023 29.24963 29.19903 29.00928 
     196      197      198      199      200      201      202      203 
29.57853 28.90808 29.45203 29.03458 29.18638 29.52793 29.24963 29.18638 
     204      205      206      207      208      209      210      211 
29.24963 29.64178 29.45203 29.57853 29.05988 28.99663 29.41408 29.31288 
     212      213      214      215      216      217      218      219 
29.64178 29.19903 28.94603 29.56588 29.65443 29.31288 29.11048 29.41408 
     220      221      222      223      224      225      226      227 
29.22433 29.47733 29.60383 29.31288 29.19903 29.52793 29.22433 29.13578 
     228      229      230      231      232      233      234      235 
29.32553 29.16108 29.40143 29.23698 29.46468 29.31288 29.12313 29.41408 
     236      237      238      239      240      241      242      243 
29.45203 28.99663 29.36348 29.27493 29.26228 29.45203 29.37613 29.36348 
     244      245      246      247      248      249      250      251 
29.60383 29.14843 29.43938 29.60383 29.54058 29.36348 29.33818 29.09783 
     252      253      254      255      256      257      258      259 
29.16108 29.36348 29.21168 29.37613 29.08518 29.32553 29.42673 29.61648 
     260      261      262      263      264      265      266      267 
29.27493 29.17373 29.55323 29.50263 29.46468 29.32553 29.24963 29.65443 
     268      269      270      271      272      273      274      275 
28.92073 29.50263 29.13578 29.18638 29.45203 29.05988 29.24963 29.33818 
     276      277      278      279      280      281      282      283 
29.03458 29.23698 29.24963 29.65443 29.50263 29.23698 29.02193 29.30023 
     284      285      286      287      288      289      290      291 
28.97133 29.38878 29.35083 29.60383 29.36348 29.54058 29.07253 29.22433 
     292      293      294      295      296      297      298      299 
29.59118 28.94603 28.98398 29.07253 29.02193 29.27493 29.23698 29.57853 
     300      301      302      303      304      305      306      307 
29.45203 29.35083 29.32553 29.21168 29.54058 29.52793 29.14843 29.30023 
     308      309      310      311      312      313      314      315 
29.12313 29.46468 29.19903 29.04723 29.13578 29.00928 29.50263 28.93338 
     316      317      318      319      320      321      322      323 
29.09783 29.27493 29.30023 29.48998 29.16108 29.24963 29.11048 29.33818 
     324      325      326      327      328      329      330      331 
29.11048 29.28758 29.03458 29.04723 29.62913 29.42673 29.55323 29.13578 
     332      333      334      335      336      337      338      339 
29.48998 28.97133 29.23698 29.04723 29.46468 29.11048 29.40143 29.31288 
     340      341      342      343      344      345      346      347 
29.40143 29.35083 29.56588 29.08518 28.97133 29.13578 29.18638 29.41408 
     348      349      350      351      352      353      354      355 
29.12313 29.37613 29.45203 29.30023 29.65443 29.46468 29.03458 29.48998 
     356      357      358      359      360      361      362      363 
28.94603 29.36348 29.27493 29.30023 28.92073 28.90808 29.57853 28.99663 
     364      365      366      367      368      369      370      371 
29.60383 29.65443 29.55323 28.95868 29.28758 29.12313 29.35083 28.90808 
     372      373      374      375 
28.92073 29.26228 29.36348 29.24963 

Linear regression models outputs: residuals

model$residuals 
            1             2             3             4             5 
 3.760623e+00 -2.224328e+00 -2.184579e+00 -3.012878e+00  3.097373e+00 
            6             7             8             9            10 
 1.813622e+00 -3.800228e+00  3.727122e+00 -1.272027e+00  7.563022e+00 
           11            12            13            14            15 
 6.557385e-02 -2.532266e-01 -1.023338e+01 -7.274928e+00 -5.009977e+00 
           16            17            18            19            20 
 1.332507e+01 -1.724328e+00 -5.372529e+00 -8.885179e+00  6.572073e+00 
           21            22            23            24            25 
-1.070168e+01 -2.425778e+00  2.928671e+00 -5.479879e+00  9.193273e+00 
           26            27            28            30            31 
 6.318071e+00  2.145574e+00 -6.933793e-01 -3.874928e+00  7.895214e-01 
           32            33            34            35            36 
-4.850827e+00 -2.014677e+00  9.033873e+00 -7.325778e+00 -2.550479e+00 
           37            38            39            40            41 
-8.669279e+00  7.127471e+00  3.475672e+00  3.123873e+00 -4.181176e+00 
           42            43            44            45            46 
-5.252027e+00  4.273273e+00  2.987122e+00  3.514222e+00 -8.208079e+00 
           47            48            49            50            51 
-2.334774e-01  7.072073e+00  3.523873e+00 -2.889279e+00  2.802171e+00 
           52            53            54            55            56 
-8.472529e+00 -3.338178e+00 -1.374928e+00 -4.999279e+00  8.588322e+00 
           57            58            59            60            61 
 1.574472e+00 -9.279267e-01  7.180972e+00 -4.646029e+00  7.856872e+00 
           62            63            64            65            66 
-1.239879e+00 -3.052027e+00  6.180372e+00 -5.327927e+00 -6.415528e+00 
           67            68            69            70            71 
-8.157327e+00  4.787122e+00 -2.069628e+00 -3.622779e-01 -7.653379e+00 
           72            73            74            75            76 
-3.077327e+00 -3.276127e+00 -7.150479e+00  5.093022e+00 -2.050827e+00 
           77            78            79            80            81 
 1.459942e+01 -7.288777e+00  8.641572e+00 -4.904426e+00 -8.047578e+00 
           82            83            84            85            86 
-8.747229e+00  1.906523e+00  4.722073e+00  2.528671e+00  2.756719e-01 
           87            88            89            90            91 
-3.142727e-02  9.250372e+00 -2.663728e+00 -3.000577e+00  1.520972e+00 
           92            93            94            95            96 
-3.748428e+00  1.663022e+00 -5.630827e+00 -6.048079e+00 -2.844262e-01 
           98            99           100           102           103 
 1.630220e-01  4.164222e+00 -6.248178e+00 -5.274426e+00 -3.823227e+00 
          104           105           106           107           108 
 5.812673e+00 -7.221678e+00  3.141321e+00 -1.843128e+00 -3.666727e+00 
          109           110           111           112           113 
-9.087578e+00 -6.080794e-01  1.837447e+01 -3.458679e+00  4.033709e-01 
          114           115           116           117           118 
 1.863022e+00  1.992397e+01  8.189271e+00  1.135574e+00 -1.584284e-01 
          119           120           121           122           123 
 1.539173e+00  8.588322e+00 -1.714077e+00 -1.765877e+00 -4.238428e+00 
          124           125           126           127           128 
 3.528671e+00 -5.808079e+00 -5.747229e+00 -9.738079e+00 -6.161329e+00 
          129           130           131           132           133 
-2.873227e+00  6.268671e+00 -3.895277e+00  6.983371e+00  1.541602e+01 
          134           135           136           137           138 
 4.126523e+00  4.051572e+00  3.665226e-01 -1.125528e+00 -2.125179e+00 
          139           140           141           142           143 
 4.188322e+00 -7.218079e+00  8.997723e-01 -8.382878e+00 -2.804426e+00 
          144           145           146           147           148 
 5.394233e-01  4.331921e+00 -7.365877e+00 -6.926978e+00  1.214735e-01 
          149           150           151           152           153 
-6.897829e+00  6.527712e-01  1.594233e-01  5.589423e+00 -1.259879e+00 
          154           155           156           157           158 
-9.943477e+00  1.111807e+01 -1.610479e+00 -5.109279e+00 -6.871078e+00 
          159           160           161           162           163 
-4.423979e+00  7.800972e+00 -4.302627e+00 -4.349628e+00 -3.169377e+00 
          164           165           166           167           168 
-6.521427e+00 -4.617229e+00  5.086021e+00 -7.758679e+00 -9.844262e-01 
          169           170           171           172           173 
-2.871329e+00 -8.666727e+00 -9.800827e+00  4.961736e-01  7.598071e+00 
          174           175           176           177           178 
 5.690721e+00  3.852673e+00  1.045574e+00 -4.378526e+00  6.949173e+00 
          179           180           181           182           183 
-8.193826e+00  3.326272e+00  2.427712e-01 -8.262278e+00  9.961736e-01 
          184           185           186           187           188 
-2.381775e-01 -2.149628e+00 -2.054426e+00 -1.874928e+00  9.539772e+00 
          189           190           191           192           193 
 1.037447e+01  1.776662e+01 -5.044677e+00 -2.250228e+00 -9.019628e+00 
          194           195           196           197           198 
 9.718385e-04 -3.909279e+00 -6.748526e+00 -2.698079e+00 -1.520270e-01 
          199           200           201           202           203 
-9.034579e+00 -6.396378e+00  1.154207e+01  5.950372e+00  5.436218e-01 
          204           205           206           207           208 
 1.043037e+01 -6.981776e+00  1.987973e+00  1.039147e+01  3.957012e+01 
          209           210           211           212           213 
 4.403371e+00  9.295923e+00  1.217712e+01 -6.451776e+00 -1.899028e+00 
          214           215           216           217           218 
-5.526029e+00  2.534123e+00  4.865574e+00 -3.012878e+00 -5.570479e+00 
          219           220           221           222           223 
 9.592279e-02 -3.874328e+00 -4.807327e+00 -2.803826e+00  4.887122e+00 
          224           225           226           227           228 
 5.970972e+00  9.872073e+00  5.175672e+00 -1.052578e+01  4.474472e+00 
          229           230           231           232           233 
-6.871078e+00 -5.814273e-01 -4.369780e-01 -3.864677e+00  2.451712e+01 
          234           235           236           237           238 
 2.776872e+00  2.285923e+00 -2.852027e+00 -9.866291e-01  5.236523e+00 
          239           240           241           242           243 
-5.694928e+00 -7.122278e+00 -6.152027e+00 -1.118613e+01 -3.763477e+00 
          244           245           246           247           248 
 1.761736e-01 -9.548428e+00 -5.739377e+00  1.379617e+01  1.959423e+00 
          249           250           251           252           253 
-6.163477e+00  6.061822e+00  1.902171e+00 -9.610783e-01 -6.363477e+00 
          254           255           256           257           258 
-3.111678e+00 -1.516127e+00  5.482134e-02 -8.295528e+00  1.632728e-01 
          259           260           261           262           263 
 5.883524e+00 -7.274928e+00  3.326272e+00  3.467734e-01 -3.602627e+00 
          264           265           266           267           268 
 2.645323e+00 -1.125528e+00 -5.649628e+00 -6.054426e+00 -4.700729e+00 
          269           270           271           272           273 
 1.176737e+01  4.284222e+00 -9.786378e+00 -4.522027e+00 -1.459879e+00 
          274           275           276           277           278 
-1.449628e+00 -2.338178e+00 -3.274579e+00 -3.436978e+00 -3.599628e+00 
          279           280           281           282           283 
-4.854426e+00 -2.302627e+00  9.323022e+00 -1.521929e+00  4.499772e+00 
          284           285           286           287           288 
-8.371329e+00  4.581223e+00 -9.810827e+00  2.336174e+00 -3.263477e+00 
          289           290           291           292           293 
-1.950577e+00  1.160747e+01 -3.374328e+00 -1.991176e+00 -1.114603e+01 
          294           295           296           297           298 
-2.383979e+00 -6.072529e+00 -8.551929e+00  7.315072e+00  1.080302e+01 
          299           300           301           302           303 
-7.285265e-01  5.997973e+00 -6.150827e+00 -8.625528e+00 -3.521678e+00 
          304           305           306           307           308 
 1.865942e+01 -3.747927e+00 -8.248428e+00 -1.900228e+00 -2.312850e-02 
          309           310           311           312           313 
-7.984677e+00  1.148097e+01 -6.277229e+00 -2.995778e+00 -2.509279e+00 
          314           315           316           317           318 
 3.117373e+00 -4.933379e+00 -9.978286e-01  2.535072e+00  5.497723e-01 
          319           320           321           322           323 
 1.278002e+01 -1.411078e+00 -6.409628e+00  4.799521e+00 -6.318178e+00 
          324           325           326           327           328 
 3.479952e+01  2.512422e+00  1.256542e+01  6.352771e+00 -5.029126e+00 
          329           330           331           332           333 
 4.273273e+00  1.044677e+01  6.964222e+00  2.860023e+00  3.028671e+00 
          334           335           336           337           338 
-7.696978e+00 -2.472288e-01 -1.464677e+00  1.098952e+01 -9.014273e-01 
          339           340           341           342           343 
 1.168712e+01  3.798573e+00  4.749173e+00 -3.395877e+00  3.664821e+00 
          344           345           346           347           348 
-2.711329e+00  5.094222e+00 -6.566378e+00  1.188592e+01  8.516872e+00 
          349           350           351           352           353 
-4.866127e+00 -1.952027e+00 -8.070228e+00 -1.024426e+00 -3.024677e+00 
          354           355           356           357           358 
 2.675421e+00  2.000231e-01  1.045397e+01 -2.863477e+00  9.650722e-01 
          359           360           361           362           363 
 7.499772e+00 -6.120729e+00 -2.080794e-01 -3.398526e+00 -6.696629e+00 
          364           365           366           367           368 
-9.333826e+00  4.225574e+00 -2.873227e+00  1.185132e+01 -4.887578e+00 
          369           370           371           372           373 
 2.576872e+00  8.849173e+00 -6.508079e+00 -1.660729e+00  3.127722e+00 
          374           375 
-7.863477e+00  8.750372e+00 

DIAGNOSTIC PLOTS

The following plots help us assessing that (some of) the assumptions of linear regression are met!

(the independence assumption is more linked to the study design than to the data used in modeling)

Linear regression diagnostic plots: residuals v. fitted values 1/4

ASSUMPTION 1: there exists a linear relationship between the independent variable, x, and the dependent variable, y

For an observation \((x_i, y_i)\), where \(\hat{y}_i\) is the predicted value according to the line \(\hat{y} = b_0 + b_1x\), the residual is the value \(e_i = y_i - \hat{y}_i\)

  • A linear model is a particularly good fit for the data when the residual plot shows random scatter above and below the horizontal line.
    • (In this R plot, we look for a red line that is fairly straight)
# residual plot
plot(model, which = 1 )

Linear regression diagnostic plots: residuals v. fitted values 1/4

Linear regression diagnostic plots: normality of residuals 2/4

ASSUMPTION 2: The residuals of the model are normally distributed

With the quantile-quantile plot (Q_Q) we can checking normality of the residuals.

# quantile-quantile plot
plot(model, which = 2 )

Linear regression diagnostic plots: normality of residuals 2/4

The data appear roughly normal, but there are deviations from normality in the tails, particularly the upper tail.

Linear regression diagnostic plots: Homoscedasticity 3/4

ASSUMPTION 3: The residuals have constant variance at every level of x (“homoscedasticity”)

This one is called a Spread-location plot: shows if residuals are spread equally along the ranges of predictors

# Spread-location plot
plot(model, which = 3 )

Linear regression diagnostic plots: Homoscedasticity 3/4

Linear regression diagnostic plots: leverage 4/4

This last diagnostic plot is actually not referred to any assumptions but has to do with outliers:

  • a residuals vs. leverage plot allows us to identify influential observations in a regression model
    • The x-axis shows the “leverage” of each point and the y-axis shows the “standardized residual of each point”, i.e. “How much would the coefficients in the regression model would change if a particular observation was removed from the dataset?”
    • Cook’s distance lines (red dashed lines) – not visible here – appear on the corners of the plot when there are influential cases
plot(model, which = 5 )

Linear regression diagnostic plots: leverage 4/4

In this particular case, there is no influential case, or cases